Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

نویسندگان

  • Manoj Harpalani
  • Michael Hart
  • Sandesh Signh
  • Rob Johnson
  • Yejin Choi
چکیده

Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detecting Vandalism on Wikipedia across Multiple Languages

Vandalism, the malicious modification or editing of articles, is a serious problem for free and open access online encyclopedias such as Wikipedia. Over the 13 year lifetime of Wikipedia, editors have identified and repaired vandalism in 1.6% of more than 500 million revisions of over 9 million English articles, but smaller manually inspected sets of revisions for research show vandalism may ap...

متن کامل

Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features

Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content. In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia vandalism detection: a spatio-temporal analysi...

متن کامل

Using Language Models to Detect Wikipedia Vandalism

This paper explores a statistical language modeling approach for detecting Wikipedia vandalism. Wikipedia is a popular and influential collaborative information system. The collaborative nature of authoring, as well as the high visibility of its content, have exposed Wikipedia articles to vandalism, defined as malicious editing intended to compromise the integrity of the content of articles. Ex...

متن کامل

Enriching Wikipedia Vandalism Taxonomy via Subclass Discovery

This paper adopts an unsupervised subclass discovery approach to automatically improve the taxonomy of Wikipedia vandalism. Wikipedia vandalism, defined as malicious editing intended to compromise the integrity of the content of articles, exhibits heterogeneous characteristics, making it hard to detect automatically. The categorization of vandalism provides insights on the detection of vandalis...

متن کامل

"Got You!": Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling

Discriminating vandalism edits from non-vandalism edits in Wikipedia is a challenging task, as ill-intentioned edits can include a variety of content and be expressed in many different forms and styles. Previous studies are limited to rule-based methods and learning based on lexical features, lacking in linguistic analysis. In this paper, we propose a novel Web-based shallow syntacticsemantic m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011